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Abstract.  We  study  infinite  stochastic  games  played  by  n-players  on 
a  finite  graph  with  goals  given  by  sets  of  infinite  traces.  The  games  are 
stochastic  (each  player  simultaneously  and  independently  chooses  an  ac¬ 
tion  at  each  round,  and  the  next  state  is  determined  by  a  probability 
distribution  depending  on  the  current  state  and  the  chosen  actions),  in¬ 
finite  (the  game  continues  for  an  infinite  number  of  rounds),  nonzero  sum 
(the  players’  goals  are  not  necessarily  conflicting),  and  un discounted.  We 
show  that  if  each  player  has  a  reachability  objective,  that  is,  if  the  goal 
for  each  player  i  is  to  visit  some  subset  Ri  of  the  states,  then  there  exists 
an  e-Nash  equilibrium  in  memoryless  strategies.  We  study  the  complex¬ 
ity  of  finding  such  Nash  equilibria.  Given  an  n-player  reachability  game, 
and  a  vector  of  values  («!,...,«„),  we  show  it  is  NP-hard  to  determine 
if  there  exists  a  memoryless  e-Nash  equilibrium  where  each  player  gets 
payoff  at  least  w*.  On  the  other  hand,  for  every  fixed  e,  the  value  can  be 
e-approximated  in  FNP. 

We  study  two  important  special  cases  of  the  general  problem.  First,  we 
study  n-player  turn-based  probabilistic  games,  where  at  each  state  atmost 
one  player  has  a  nontrivial  choice  of  moves.  For  turn-based  probabilistic 
games,  we  show  the  existence  of  e-Nash  equilibria  in  pure  strategies  for 
all  games  where  the  goal  of  each  player  is  a  Borel  set  of  infinite  traces. 
We  also  derive  the  existence  of  pure  exact  Nash  equilibria  for  n-player 
turn-based  games  where  each  player  has  an  oj-regular  objective. 

Then  we  study  the  two  player  case  and  show  that  already  for  two-player 
games  exact  Nash  equilibria  may  not  exist.  Our  techniques  for  the  gen¬ 
eral  case  also  yield  NP  n  coNP  e-approximation  algorithms  for  zero-sum 
reachability  games,  improving  the  previously  known  EXPTIME  bound. 


1  Introduction 

The  interaction  of  several  agents  is  naturally  modeled  as  non-cooperative  games 
[25,27].  The  simplest,  and  most  common  interpretation  of  a  non-cooperative 
game  is  that  there  is  a  single  interaction  among  the  players  (“one-shot”),  after 
which  the  payoffs  are  decided  and  the  game  ends.  However,  many,  if  not  all, 
strategic  endeavors  occur  over  time,  and  in  stateful  manner.  That  is,  the  games 
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Polish  KBN  grant  7-T11C-027-20,  and  the  EU  RTN  HPRN-CT-2002-00283. 


progress  over  time,  and  the  current  game  is  decided  based  on  the  history  of 
the  interactions.  Infinite  stochastic  games  [30,8]  form  a  natural  model  for  such 
interactions.  A  stochastic  game  is  played  over  a  state  space,  and  is  played  in 
rounds.  In  each  round,  each  player  chooses  an  available  action  simultaneously 
with  and  independently  from  all  other  players,  and  the  game  moves  to  a  new 
state  under  a  possibly  probabilistic  transition  relation  based  on  the  current  state 
and  the  joint  actions.  For  the  verification  and  control  of  reactive  systems,  such 
games  are  infinite:  play  continues  for  an  infinite  number  of  rounds,  giving  rise 
to  an  infinite  sequence  of  states,  called  the  outcome  of  the  game.  The  players 
receive  a  payoff  based  on  a  payoff  function  mapping  infinite  outcomes  to  a  real 
in  [0, 1]. 

Payoffs  are  generally  Borel  measurable  functions  [23].  For  example,  the  payoff 
set  for  each  player  is  a  Borel  set  in  the  Cantor  topology  on  5“  (where  S  is  the 
set  of  states) ,  and  player  i  gets  payoff  1  if  the  outcome  of  the  game  is  a  member  of 
Bi,  and  0  otherwise.  In  verification,  payoff  functions  are  usually  index  sets  of  oj- 
regular  languages,  oj-regular  sets  occur  in  low  levels  of  the  Borel  hierarchy  (they 
are  in  S3  nils),  but  they  form  a  robust  and  expressive  language  for  determining 
payoffs  for  commonly  used  specifications  [20].  The  simplest  w-regular  games 
correspond  to  safety  (closed  sets)  or  reachability  (open  sets)  objectives. 

Traditionally  automata  theory  and  verification  has  considered  zero  sum  or 
strictly  competitive  versions  of  stochastic  games.  In  these  games  there  are  two 
players  with  complementary  objectives;  so  the  payoff  for  one  is  one  minus  the 
payoff  of  the  other.  We  argue  that  in  many  modeling  instances  this  is  too  pes¬ 
simistic  an  assumption.  The  environment  of  a  component  in  a  distributed  system 
is  not  necessarily  malicious.  In  fact,  many  natural  interactions  are  modeled  as 
a  game  between  several  components  each  with  its  own  specification,  and  each 
component  is  interested  solely  in  establishing  its  own  specification  without  re¬ 
gard  to  the  specification  of  other  components.  For  example,  consider  a  set  of  n 
processors  each  sending  out  data  on  a  common  network.  At  each  round,  each 
process  can  decide  to  send  data  or  do  nothing.  If  more  than  one  process  tries 
to  send  data  simultaneously,  then  there  is  a  conflict  and  the  data  is  not  sent;  if 
there  is  a  unique  processor  sending  out  data  in  a  round,  then  its  data  is  sent  out. 
The  game  for  two  processors  is  schematically  shown  in  Figure  1.  It  can  be  easily 
generalized  for  n  processors.  Each  processor  wishes  to  send  an  infinite  number 
of  data  packets,  that  is,  process  i  has  the  specification  that  the  game  visits  the 
node  i  infinitely  often. 

Traditionally,  the  system  will  be  modeled  as  a  zero  sum  game  between  process 
i  and  an  environment  consisting  of  all  other  processes,  and  the  requirement  will 
be  specified  in  a  game  logic  such  as  alternating-time  temporal  logic  [1]  as  ((i))  DOi, 
that  is,  we  ask  if  player  i  has  a  strategy  to  visit  node  i  infinitely  often,  against  all 
strategies  of  the  other  players.  This  condition  is  too  restrictive,  and  indeed,  this 
cannot  be  proved  for  the  network  game  (consider  a  strategy  of  the  environment 
where  all  the  other  processors  try  to  send  at  each  round).  We  claim  that  the 
right  way  to  model  this  system  is  as  a  non- zero  sum  game,  where  each  processor 
i  has  the  obligation  DOi,  and  is  solely  interested  in  ensuring  its  specification 
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denotes  neither  of  the  processes  send  packets. 

(1,2):  denotes  both  processes  send  packets. 

(1,_):  denotes  process  1  sends  packets  and  process  2  does  not 
(_,2):denotes  process  1  does  not  send  packet  and  process  2  does. 

Fig.  1.  The  two  processor  game 


without  regard  to  the  specifications  of  the  other  players.  The  solution  concept 
in  such  games  is  a  Nash  equilibrium  [13],  that  is,  a  strategy  profile  such  that  no 
processor  can  gain  by  deviating  from  the  profile,  assuming  all  other  processors 
continue  playing  their  strategies  in  the  profile.  However,  the  existence  of  Nash 
equilibria  in  infinite  games  is  not  clear. 

Notice  that  the  network  game  has  a  Nash  equilibrium  where  the  proces¬ 
sors  are  allocated  time  slots,  and  processor  i  only  sends  in  time  slot  k  where  k 
mod  n  =  i.  Indeed  this  is  a  solution  adopted  in  time  triggered  protocols  in  real 
time  systems  [16].  There  is  a  (symmetric)  equilibrium  in  the  game  as  well:  each 
processor  rolls  an  n-sided  dice,  and  sends  data  only  if  the  dice  shows  1.  Then 
with  probability  1,  all  processors  can  send  data  infinitely  often.  Interestingly, 
exponential  backoff  behavior  implemented  in  real  networks  also  have  the  above 
property,  indeed,  it  is  a  Nash  equilibrium  where  the  strategy  of  a  player  is  obliv¬ 
ious  to  the  total  number  of  processes  participating  in  the  game.  The  emergence 
of  quite  rich  behavior  in  such  a  simple  example  shows  the  modeling  power  of 
stochastic  games. 

This  work  is  motivated  by  the  result  by  Secchi  and  Sudderth  [29] .  Secchi  and 
Sudderth  [29]  proved  that  a  Nash  equilibrium  exists  for  safety  conditions  where 
each  player  i  has  a  subset  of  states  5,  as  their  safe  states  and  gets  a  payoff  1  if  the 
play  never  leaves  the  set  5,  and  else  get  payoff  0.  In  the  open  (or  reachability) 
game,  each  player  i  has  a  subset  of  states  i?,  as  reachability  targets.  Player  i  gets 
payoff  1  if  the  outcome  visits  some  state  from  i?,  at  some  point,  and  0  otherwise. 
Our  main  results  on  reachability  games  are  summarised  below. 

1.  We  show  that  reachability  games  on  finite  state  spaces  always  have  e-Nash 

equilibria  in  memoryless  strategies.  This  is  the  best  one  can  hope  for:  there 
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are  two  player  non  zero  sum  reachability  games  with  no  Nash  equilibria  [17]. 
In  general  equilibrium  strategies  require  randomization. 

2.  We  show  that  the  problem  of  finding  an  e-Nash  equilibrium  in  memoryless 
strategies  where  each  player  gets  at  least  some  (specified)  payoff  is  NP-hard. 
Related  NP-hardness  results  appear  in  [5],  but  our  results  do  not  follow 
from  theirs  as  our  payoffs  are  restricted  to  be  binary.  Moreover,  for  any 
constant  e,  we  give  an  NP  algorithm  to  approximate  the  value  of  some  e- 
Nash  equilibrium  in  memoryless  strategies.  Already  for  two-person  zero-sum 
games  with  reachability  objective,  values  can  be  algebraic  and  there  are 
simple  examples  when  they  are  irrational  [8] .  Hence  approximating  the  values 
is  the  best  one  can  achieve. 

Together  with  [29]  this  solves  the  existence  question  for  the  lowest  level  of  the 
Borel  hierarchy.  We  leave  the  existence  of  Nash  equilibria  in  stochastic  games 
where  objectives  are  sets  in  higher  levels  of  the  Borel  hierarchy  as  an  interesting 
open  question.  We  also  study  two  important  special  cases  of  the  general  problem: 
turn-based  probabilistic  games  [33,28],  where  at  each  stage,  at  most  one  player 
has  a  nontrivial  choice  of  actions,  and  the  two-player  case  of  general  stochastic 
games. 

For  the  special  case  of  two  person  turn-based  probabilistic  zero  sum  games 
we  prove  a  pure  strategy  determinacy  theorem  for  all  Borel  payoff  functions.  The 
proof  is  a  specialization  of  Martin’s  determinacy  proof  for  stochastic  games  with 
Borel  payoffs  [23].  Using  this,  and  a  general  construction  of  threat  strategies  [25], 
we  show  that  e-Nash  equilibria  exist  for  all  turn  based  probabilistic  games  with 
arbitrary  Borel  set  payoffs.  Moreover,  using  further  structural  properties  for 
turn-based  probabilistic  parity  games,  we  show  the  existence  of  pure  strategy 
Nash  equilibria  for  parity  payoffs.  Since  parity  games  are  a  canonical  form  for 
w-regular  properties  [31],  this  proves  that  (exact)  Nash  equilibria  exist  for  turn 
based  probabilistic  games  with  w-regular  payoffs.  Using  an  NP  n  co-NP  strategy 
construction  algorithm  for  parity  games  [3],  we  get  an  NP  algorithm  to  find  a 
Nash  equilibrium  in  these  games. 

For  the  special  case  of  two-player  (concurrent)  games,  we  show  an  improved 
NP  n  co-NP  upper  bound  to  approximate  the  values  for  two-person  zero  sum 
reachability  games  within  e-tolerance  for  any  constant  e,  improving  the  pre¬ 
viously  best  known  EXPTIME  upper  bound  [8].  This  generalizes  a  result  of 
Condon  [4].  Notice  that  the  solution  of  a  zero-sum  reachability  game  can  be 
irrational,  hence  we  can  only  hope  to  compute  it  to  an  e-precision. 


Related  Work 

Stochastic  games  were  introduced  by  Shapley  [30]  and  have  been  extensively 
studied  in  several  research  communities;  the  book  of  Eilar  and  Vrieze  [10]  pro¬ 
vides  a  unified  treatment  of  the  theories  of  stochastic  games  and  Markov  decision 
processes.  Existence  of  Nash  equilibria  in  (nonzero  sum)  discounted  stochastic 
games  was  proved  by  Eink  [11].  Since  then,  several  results  have  appeared  for 
special  cases  [32,33].  One  of  the  most  important  results  in  stochastic  games  in 
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recent  times  is  due  to  Vieille  [34],  [35]  where  he  shows  the  existence  of  e-Nash 
equilibria  for  two-person  non-zero  sum  game  with  limit  average  criteria.  The 
existence  of  Nash  equilibria  for  n-person  stochastic  games  with  limit  average 
criteria  is  still  open.  Our  result  shows  in  the  special  case  of  turn-based  proba¬ 
bilistic  n-person  games  e-Nash  equilibria  exists  as  limit  average  criteria  occurs 
in  low  levels  of  Borel  hierarchy. 

Infinite  games  with  Borel  winning  conditions  have  been  studied  by  descrip¬ 
tive  set  theorists  [15].  Martin  [23]  proved  the  determinacy  result  for  two-person 
stochastic  zero  sum  games  with  Borel  payoff,  building  on  his  earlier  proof  of 
Borel  determinacy  in  perfect  information  games  [22].  This  result  was  extended 
by  Maitra  and  Sudderth  [19].  In  the  case  of  non-zero  sum  games  the  existence 
of  Nash  equilibria  for  Borel  payoffs  remain  some  of  the  most  important  ques¬ 
tions  in  stochastic  games.  Secchi  and  Sudderth  [29]  showed  the  existence  of  Nash 
equlibria  with  safety  conditions. 

Computing  the  values  of  a  Nash  equlibria,  when  it  exists,  is  another  chal¬ 
lenging  problem  [26,36].  Recently  [5]  show  hardness  of  several  such  questions. 
Condon  [4]  studies  two-person  turn-based  probabilistic  discounted  games  with 
reachability  objective  and  showed  that  the  values  at  a  state  can  be  computed 
in  NP  n  co-NP.  Her  result  can  be  applied  to  show  that  the  values  of  a  two- 
person  turn-based  probabilistic  zero-sum  games  with  reachability  objective  can 
be  computed  in  NP  n  co-NP.  We  show  that  for  the  general  case  of  two-person 
(concurrent)  zero-sum  games  with  reachability  objective  values  can  be  approxi¬ 
mated  in  NP  n  co-NP.  For  zero-sum  stochastic  games  with  w-regular  objectives, 
[8]  gives  doubly  exponential  algorithms,  and  [3]  gives  more  efficient  algorithms 
for  the  turn-based  case. 


2  Definitions 

An  n-person  stochastic  game  G  consists  of  a  finite,  nonempty  set  of  states  S, 
n  players  1,  2, . . . ,  n,  a  finite  set  of  action  sets  Ai ,  A2 , . . . ,  for  the  players,  a 
conditional  probability  distribution  p  on  5  x  {Ai  x  A2  x  •  •  •  x  A„)  called  the  law 
of  motion,  and  bounded,  real  valued  payoff  functions  4>i  ■,  4)2 ,  ■  ■  ■ ,  4>n  defined  on 
the  history  space  H  =  S  x  Ax  S  x  A  -  ■  ■ ,  where  A  =  Ai  x  A2  x  ■  ■  ■  An-  The  game 
is  called  a  n-player  deterministic  game  if  for  all  states  s  €  5  and  action  choices 
a  =  {a},a4, . . . ,  a")  there  is  a  unique  state  s'  such  that  p{s'\s,a)  =  1. 

Play  begins  at  an  initial  state  sq  =  s  £  S.  Each  player  independently  and 
concurrently  selects  a  mixed  action  a]  with  a  probability  distribution  (T,(s)  be¬ 
longing  to  ViAi),  the  set  of  probability  measures  on  A,.  Given  sq  and  the  chosen 
mixed  actions  a}  =  {a\,a\, ..  .a]4)  &  A,  the  next  state  si  has  the  probability  dis¬ 
tribution  p(-jso,a^)-  Then  again  each  player  i  independently  selects  of  with  a 
distribution  (t,((so,  si))  and  given  =  (oj,  a|, . . . ,  a^),  the  next  state  S2  has 
the  probability  distribution  p(-jsi,a^).  Play  continues  in  this  fashion  thereby 
generating  a  random  histroy  h  =  (so,a^,si,a^, . . .)  £  H.  Note  that  the  game 
continues  for  an  infinite  number  of  steps  [9],  and  the  payoff  is  decided  based 
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on  the  infinite  outcome.  This  is  useful  to  model  interactions  between  reactive 
systems  [21]. 

A  function  tt,  that  specifies  for  each  partial  history  h'  =  (sq,  ,  si ,  ,  •  •  • ,  s*) 

the  conditional  distribution  TTi{h')  €  V{Ai)  for  player  Ts  next  action 
is  called  a  strategy  for  player  i.  A  strategy  profile  tt  =  (tti , 7r2, . . . ,  7r„)  con¬ 
sists  of  a  strategy  tt,  for  each  player  i.  A  selector  for  a  player  i  is  a  mapping 
(Ti  :  S  ^  V{Ai).  A  selector  profile  a  =  ((Ti  ,  <72, . . . ,  (T„)  consists  of  a  selec¬ 
tor  <7i  for  each  player  i.  The  memoryless/stationary  strategy  crf°  for  player  i 
is  the  strategy  which  choses  mixed  action  (T,(s')  each  time  the  play  visits  s'. 
A  strategy  profile  a°°  =  . . .  ,(t“)  is  a  memoryless  startegy  profile  if 

all  the  strategies  (Tf° , . . .  are  memoryless.  Given  a  memoryless  strat¬ 
egy  profile  a°°  =  . . . ,  we  write  a  =  ((Ti  ,  <72, . . . ,  (T„)  to  denote  the 

corresponding  selector  profile  for  the  players.  An  initial  state  s  and  a  strategy 
profile  TT  =  (tti  ,  772, . . . ,  7r„)  together  with  the  law  of  motion  p  determine  a  prob¬ 
ability  distribution  on  the  history  space.  We  write  for  the  expectation 
operator  associated  with  Pg^j^. 

Assume  now  that  the  payoff  functions  (fti  :  H  ^  W  are  bounded  and  measur¬ 
able,  where  K  is  the  set  of  reals.  If  the  initial  state  of  the  game  is  s  and  each 
player  i  choses  a  strategy  tt,  ,  then  the  payoff  to  each  player  i  is  the  expectation 
Eg^n4>ii  where  tt  is  the  strategy  profile  tt  =  (771,772, . . .  ,77„). 

For  e  >  0,  an  e-equilibrium  at  the  initial  state  s  is  a  profile  tt  = 
(tti ,  772, ... ,  77„)  such  that,  for  alH  =  1, 2, . . . , n 

Mi 

where  pi  ranges  over  the  set  of  all  strategies  for  player  i.  In  other  words,  each  tt, 
gurantees  an  expected  payoff  for  player  i  which  is  within  e  of  the  best  possible 
expected  payoff  for  player  i  when  every  other  player  j  ^  i  playes  TTj.  A  0- 
equilibrium  is  called  a  Nash  equilihirum  and  for  every  e  >  0  an  e-equilibrium  is 
called  an  e-Nash  equilibrium  [13].  A  strategy  profile  tt  for  an  e-Nash  equilibrium 
is  referred  as  the  e-equilibrium  profile.  Similarly,  a  strategy  profile  tt  for  a  Nash 
equilibrium  is  referred  as  the  Nash  equilibrium  profile. 

Let  r,  :  5  K  be  a  daily  reward  function  for  player  i,i  =  1,2, ...  ,n.  It  is 
known  that  Nash  equilibria  exist  for  some  interesting  payoff  functions  such  as  a 
discounted  payoff 

00 

4>i(h)  =  ^/?”ri(si,),0  <  P  <1 
k^O 

(cf.  Mertens  and  Parthasarathy  [24]  and  the  references  there) ,  but  need  not  exist 
for  other  payoff  functions  such  as  an  average  reward 


n  — 1 


(t>i{h)  =  limsup  -  y^rj(s*),0  <  /?  <  1 

71 


k=0 


even  for  a  two-person,  zero-sum  game  with  finite  state  space  (cf.  Gillete  [12], 
Blackwell  and  Ferguson  [2]  for  a  famous  counterexample  and  Vielle  [34]  and  [35] 
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for  the  existence  of  “equilibirum  payoffs”  in  two  person,  non-zero  sum  games.) 
A  game  with  a  total  reward  objective  is  a  game  with  payoff  for  player  i  {(f>f) 
defined  as 

OO 

=  '^ri{sk) 
k^O 

which  assigns  to  a  history  a  payoff  that  is  the  sum  total  of  the  reward  of  the 
states. 

Secchi  and  Sudderth  [29]  proved  that  Nash  equilibrium  exist  for  safety  con¬ 
ditions  where  each  player  i  has  a  subset  of  states  5,  as  their  safe  states  and  gets 
a  payoff  1  if  the  play  never  leaves  the  set  5,  and  else  get  payoff  0.  That  is,  let 
5“ ,  5“ , . . . ,  5“  be  the  subsets  of  of  H  defined  by 

=  {/i  =  (so,  si,  •  •  • )  •  s*  €  5,  for  all  A:  =  0, 1, . . .  } 

and  take  the  payoff  function  to  be  the  indicator  function  of  Sf  for  i  = 
1,2, ...  ,n.  The  problem  for  Nash  equilibrium  for  reachability  objective  was  left 
open.  In  this  work  we  show  for  every  positive  e  we  have  a  e-Nash  equilibrium  in 
n-player  stochastic  games  with  reachability  objective.  We  now  formally  define 
the  payoff  functions.  To  define  the  payoff  functions  we  study,  let  Ri,R2,.  ■  ■  ,Rn 
be  subsets  of  the  state  space  S.  The  subet  of  states  i?,  is  referred  as  the  target 
set  for  player  i.  Then  let  i?“,  i?“, . . . ,  be  the  subsets  of  H  defined  by 

R°°  =  {h  =  (so,a^Sl,a^  . . .)  :  3A:,s*  G  i?*} 

and  take  the  payoff  function  to  be  the  indicator  function  of  Rf^  for  i  = 
1,2, ...  ,n.  Thus  each  player  recieves  a  payoff  of  1  if  the  process  of  states  sq,  si ,  •  •  • 
reaches  a  state  in  i?,;  and  recieves  payoff  0  otherwise.  We  call  stochastic  games 
with  the  payoff  functions  of  this  form  reach- a- set- games. 


3  Existence  of  e-Nash  Equilibria 

We  define  a  few  more  notations  which  we  will  use  in  our  proofs  below. 
Given  a  strategy  profile  r  =  (ri ,  r2, . . . ,  r„)  the  strategy  profile  = 
(n,---,  . . .  ,Tn)  is  the  strategy  profile  obtained  by  deleting  the 

strategy  r,  from  r  whereas  for  any  strategy  yu,  of  player  i,  p(T-i,iii)  = 
(ti  , . . . ,  Ti-i ,  /ij,  Tj-i-i , . . . ,  Tn)  denotes  the  strategy  profile  where  player  i  follows 
Pi  and  the  other  players  follows  the  strategy  of  r_j.  Similar  definitions  hold  for 
selector  profiles  as  well.  The  main  result  of  this  section  is  the  existence  of  e-Nash 
equilibria. 

Theorem  1  (e-equilibrium).  A  n-person  reach- a- set- game  G  with  a  finite 
state  space  has  an  e-Nash  equilibrium  at  every  initial  state  s  G  S  for  every 
positive  e.  Moreover,  there  is  a  memoryless  e-equilibrium  strategy  profile. 

As  Example  1  below  shows,  even  for  2-player  games  with  reachability  objec¬ 
tive  Nash  equilibrium  need  not  exist.  Hence  e-Nash  equilibrium  is  the  best  one 
can  achieve  for  n-person  reach-a-set-games. 
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Example  1.  [e-equilibrium]  Consider  the  following  game,  adapted  from  [9,17]. 
The  state  space  of  the  game  is  5  =  {s,t,u}.  The  action  set  for  player  1  in  state 
s  is  {a,b}  and  for  player  2  is  {c,d}.  The  state  t,u  are  absorbing  states  in  the 
sense  when  the  process  of  states  reaches  t,u  it  stays  there  for  ever.  The  game 
has  a  deterministic  law  of  motion  p  as  follows: 

p(s\s,a,c)  =  p(t\s,a,d)  =  p(t\s,b,c)  =  p(u\s,b,d)  =  1. 

The  target  set  for  player  1  is  {t}  and  for  player  2  is  {u}.  For  every  e  >  0  player  1 
chooses  move  a,  b  with  probability  1  —  e  and  e  respectively  to  ensure  reaching 
the  state  t  with  probability  of  1  —  e  from  s.  However,  player  1  has  no  strategy 
reach  t  with  probability  1:  if  player  1  decides  to  play  move  b  at  the  n-th  round 
of  the  game,  player  2  can  play  move  d  at  the  n-th  round,  so  that  the  probability 
of  reaching  t  is  always  less  than  1.  I 

Definition  1  (/1-disconnted  games).  Given  a  n-player  game  G  we  use  G^  to 
denote  a  (d-diseounted  version  of  the  game  G.  The  game  G^  at  eaeh  step  halts 
with  probability  jd  (goes  to  a  speeial  sink  state  halt  whieh  has  a  reward  0  for 
every  player)  and  eontinues  as  the  game  G  with  probability  1  —  p.  P  is  ealled  the 
diseount-faetor.  I 

Definition  2  (Markov  Decision  Process  (MDP)  reach-a-set-game).  A 

Markov  Deeision  Proeess  (MDP)  is  a  1-player  stoehastie  game.  A  MDP  reaeh- 
a-set-game  is  a  1  player  stoehastie  reaeh-a-set-game.  I 

Definition  3  (Valnes  of  MDP).  Given  a  MDP  reaeh-a-set-game  G  the  value 
of  the  game  at  state  s  is  denoted  by 

v(s)  =  supTi^^^^f' 

TT 

where  tt  ranges  over  all  strategy  and  is  the  reaeh-a-set-game  payoff  for  the 
player  in  the  game  G.  Similarly,  we  use 

n^(s)  =  supTi^^^^f 

TT 

to  denote  the  value  at  state  s  in  the  game  G^ ,  where  G^  is  the  P-diseounted 
version  of  the  game  G.  In  a  similar  way  given  a  MDP  Gt  with  a  total  reward 
objeetive  we  use  the  following  notation 

vt(s)  =  supEs^„(f)C. 

TT 

Also,  v(^{s)  denote  the  value  at  state  s  for  the  game  whieh  is  the  P-diseounted 
version  of  the  game  Gt-  ■ 

Lemma  1.  Let  G  be  a  MDP  reaeh-a-set-game  and  G^  be  the  P-diseounted  ver¬ 
sion  of  G.  Then  for  all  state  s  G  S  we  have 

v{s)  —  V^{s)  <  p. 
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Proof.  Given  a  MDP  reach-a-set-game  G,  let  R  C  S  be  the  target  set  for  the 
player.  We  construct  a  total  reward  game  Gt  as  follows: 

—  State  Spaee:  St  =  S  U  {sink} 

—  Reward  Funetion:  r(s)  =  1  if  s  €  i?  else  0. 

—  Law  of  Motion:  For  all  s  €  S  \  R,  we  have  pt(s'\s)  =  p(s'|s)  and  for  all 
s  G  i?  U  {sink}  we  have  pT(sink\s)  =  1. 

That  is,  in  the  total  reward  game  Gt,  defined  on  the  same  state  space  S  with 
a  special  sink  state,  for  every  state  in  R  the  game  goes  to  the  sink  state  and 
stays  there  for  ever.  The  reward  is  1  for  every  state  in  R  and  0  elsewhere.  Let 
Gy  be  the  /^-discounted  version  of  the  game  Gt  ■  It  is  easy  to  notice  that  for  all 
state  s  we  have  v{s)  =  vt{s)  and  v^{s)  =  v^{s).  It  follows  from  the  continuity 
of  the  values  of  MDP’s  with  total  reward  objective  with  0  for  positive  total 
reward  (Theorem  4.4.1,  pg-197  Filar- Vrieze  [10])  and  the  Lipschitz  continuity  of 
the  values  of  MDP’s  with  total  reward  objective  (Theorem  4.3.7,  pg-185  Filar- 
Vrieze  [10])  that  for  all  s  G  5  we  have  vt{s)  —  v^{s)  <  p.  The  required  result 
follows.  I 

Definition  4  (Stopping  time  of  history  in  /1-disconnted  games).  Con¬ 
sider  the  stopping  time  T  defined  on  histories  h  =  (so,a^,si,a^,. . .)  by 

T(h)  =  inf  {A:  >  0  :  =  halt} 

where  as  usual  the  infimum  of  the  empty  set  is  -boo.  I 

Lemma  2.  Let  G^  be  a  n-player  P-diseounted  stoehastie  game.  Then,  for  all 
initial  states  s  and  all  strategy  profiles  tt  we  have 

P,,„[T>m]  <  {1-pr 

Proof.  At  each  step  of  the  game  G^  the  game  reaches  the  halt  state  with  prob¬ 
ability  p.  Hence  the  probability  of  not  reaching  the  halt  state  in  m  steps  is 
<  (1  -/?)™.  I 

The  proof  of  the  next  Lemma  is  similar  to  the  proof  of  Lemma  2.2  of  Stay- 
in-a-set  games  of  Secchi  and  Sudderth  [29] . 

Lemma  3.  There  exist  seleetors  ai  :  S  ^  V{Ai),i  =  1,2, ...  ,n,  sueh  that  the 
memoryless  profile  a°°  =  , . . . ,  <7{p)  is  a  Nash  equilibrium  profile  in  G^ 

for  every  s  G  S. 

Proof.  Regard  each  n-tuple  a  =  ((Ti ,  <72, . . . ,  (T„)  of  selectors  as  a  vector  in  a 
compact,  convex  subset  K  of  the  appropriate  Eucledian  space.  Then  define  a 
correspondence  A  that  maps  each  element  cr  of  TP  to  the  set  A((t)  of  all  elements 
9  =  idi,  92,  ■  ■  ■  ,9n)  of  K  such  that,  for  /  =  1,2, ...  ,n  and  all  s  £  S,  9f°{s)  is  an 
optimal  response  for  player  i  in  G^  against  o-fG  =  , . . . ,  ,  afpi, . . . ,  er^p). 

Clearly,  it  suffices  to  show  that  there  is  a  a  £  K  such  that  A((t)  =  a.  To  show 
this,  we  will  verify  the  Kakutani’s  Fixed  Point  Theorem  [14]: 
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1.  For  every  a  ^  A(cr)  is  closed,  convex  and  nonempty; 

2.  If,  for  k  =  1,2,. ..,g‘^k)  G  A((T(*)),limi,_j.oo  =  g  and  limi,_j.oo  cr(*)  = 

then  g  G  A((t). 

To  verify  condition  1.,  fix  cr  =  ((Ti  ,  <72, . . . ,  (T„)  G  K  and  i  G  {l,2,...,n}.  For 
each  s  G  5,  let  v{s)  be  the  maximal  payoff  that  player  i  can  achieve  in 
against  o-fG.  Since  fixing  the  strategy  for  all  the  other  player  the  game  becomes 
a  MDP  we  know  that  gf°  is  an  optimal  response  to  o-fG  if  and  only  if,  for  each 
s  G  5,  gi(s)  puts  positive  probability  only  on  actions  a,  G  Ai  that  maximize  the 
expectation  of  v(s),  namely, 

^u(s)p(s'|s,  {ai,a-i{s))) 


Hence  condition  1.  follows  easily. 

Condition  2.  is  an  easy  consequence  of  the  continuity  mapping 


a  ^  E, 


from  K  to  the  real  line.  It  follows  from  Lemma  2  that  the  mapping  is  continuous. 


Definition  5  (Memoryless  strategy  profile,  MDP  and  Markov  Chains). 

Given  a  n-player  stochastic  game  G  let  a°°  =  (cr^ ,  (t“  , . . . ,  )  be  a  memoryless 

strategy  profile  and  a  =  ((Ti  ,  <72, . . . ,  (T„)  be  the  corresponding  selector  profile. 
Then  the  game  G^r  is  a  Markov  chain  where  the  law  of  motion  is  defined 
by  the  functions  in  selector  profile  a  and  the  law  of  motion  p  of  the  game  G. 
Similarly,  Ga-_i  is  a  Markov  Decision  process  where  the  mixed  action  of  each 
player  j  ^  i  at  a  state  s  is  fixed  according  to  the  selector  function  Ujis).  The  law 
of  the  motion  Pa_i  of  the  MDP  is  determined  by  the  selectors  in  (7_,  and  law  of 
motion  p  of  G.  I 

Lemma  4.  Given  a  n-player  stochastic  game  G  and  a  positive  e  there  is  a  mem¬ 
oryless  profile  a°°  =  . . .  ,(7“)  such  that  a°°  is  an  e-Nash  equilibrium 

profile  in  G. 

Proof.  Given  the  game  G  we  construct  a  game  G^  which  is  a  discounted  version 
of  G  with  discount-factor  e.  It  follows  from  Lemma  3  that  there  is  a  memoryless 
strategy  profile  a°°  in  G^  such  that  a°°  is  a  Nash  equilibrium  profile  in  the 
game  G^.  We  show  that  the  profile  a°°  is  an  e-equilibrium  profile  for  G.  Let 
(7  =  ((7i,(72,  . . .  ,(7„)  be  the  selector  functions  corresponding  the  strategy  profile 
a°°  =  (af°,a^, . . . ,  <7^).  Consider  any  player  i  and  the  strategy  profile  crfG.  The 
game  Ga_i  is  a  MDP  where  the  mixed  actions  of  all  the  other  players  are  fixed 
according  to  the  (7_,.  Also,  G%_.  is  the  MDP  which  is  the  e-discounted  version 
of  the  game  Ga_i .  It  follows  from  Lemma  1  that  (j°°  is  an  e-equilibrium  profile 
in  the  game  G.  I 

Lemma  4  yields  Theorem  1. 
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4  Complexity  of  computing  equilibrium  values 

Let  TT  be  an  e-equilibrium  profile.  Then,  the  value  at  a  state  s  for  a  player  i  for  the 
equilibrium  profile  tt,  denoted  vf{s),  is  The  value  of  an  e-equilibrium 

profile  TT  at  a  state  s  is  the  value  vector  v^{s)  =  (s),  •  •  •  ,u"(s)).  Our 

main  results  about  the  computational  complexity  of  computing  the  value  of  any 
e-equilibrium  profile  within  a  tolerance  of  e  are  summarized  below. 

Theorem  2  (Computing  values  of  a  memoryless  equilibrium  profile). 

For  n-player  deterministic  reach- a- set- game  G,  a  initial  state  s  and  a  value 
vector  V  =  (vi,V2,  -  ■  ■  ,u„)  it  is  NP-hard  to  determine  whether  there  is  a  Nash 
equilibrium  profile  tt  such  that  the  value  for  every  player  i  from  the  state  s  for 
the  profile  tt  ,i.e.  vf(s)  is  greater  than  equal  to  u,.  Given  a  fixed  e  there  is  a 
NP  algorithm  to  compute  if  there  is  an  e-equilibrium  profile  tt  in  memoryless 
strategies  such  that  for  all  player  i  we  have  vf(s)  >  u,  —  e. 

4.1  Reduction  of  3-SAT  to  computing  equilibrium  values 

We  first  prove  it  is  NP-hard  to  compute  a  memoryless  Nash  equilibrium  profile 
of  n-player  deterministic  reach- a-set  games  by  reduction  from  3-SAT.  Given 
a  3-SAT  formula  ^  with  n-clauses  and  m-variables  we  will  construct  a  n- 
player  deterministic  reach-a-set-game  G^.  Let  the  variables  in  the  formula  ^ 
he  xi,X2,  ■  ■  ■  ,Xra  and  the  clauses  he  Ci,C2,  ■  ■  ■  ,Cn-  hn  the  game  G^  each  clause 
is  a  player.  The  state  space  5,  the  law  of  motion  and  the  target  states  are  defined 
as  follows: 

—  State  Space: 

5  =  {1, 2, . . . ,  m,  m  +  1,  (1,  0),  (1, 1),  (2,  0),  (2, 1), . . . , 

(b  0),  (f ,  1), . . . ,  (to,  0),  (to,  l),sink}. 

—  Law  of  Motion:  For  any  state  {i,  0),  (f ,  1)  the  game  always  moves  to  the  state 
f  -I- 1.  Let  Ci  =  {Cjj ,  (7,2 , . . . ,  Ci^ }  be  the  set  of  clauses  in  which  variable  Xi 
occurs.  Then,  in  state  i  players  ii,i2,  ■  ■  ■  fik  have  a  choice  of  moves  between 
{0,1}.  If  all  the  players  chose  move  0  the  game  proceeds  to  state  (f,0),  if 
all  the  players  chose  move  1  the  game  proceeds  to  state  (f,  1),  else  the  game 
goes  to  the  sink  state.  Once  the  game  reaches  the  sink  state  or  the  state 
TO  -b  1  it  remains  there  for  ever. 

—  Target  States:  The  target  set  for  the  players  is  defined  as  follows: 

let  Cf  =  , . . . ,  }  be  the  set  of  clauses  that  are  satisfied  assigning 

Xi  =  0,  then  the  state  (f,0)  is  a  target  state  for  players  ki,k2,  ■  ■  ■  ,ki.  Simi¬ 
larly,  let  Cj  =  , . . . ,  Cl,  }  be  the  set  of  clauses  that  are  satisfied  by 

assigning  the  variable  Xi  =  1  then  the  state  {i,  1)  is  a  target  state  for  players 
k'l.k^,  ■  ■  ■  ,k'y  States  1, 2, . . . , to  -I-  1  and  the  sink  state  is  not  a  target  state 
for  any  player. 

The  game  is  illustrated  in  Figure  2. 
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Lemma  5  (NP-hardness).  Consider  a  n-player  reach- a- set- game  G,  an  ini¬ 
tial  state  s  and  a  value  vector  v  =  (vi,V2,  -  ■  ■  ,Vn)-  It  is  NP-hard  to  determine 
whether  there  is  a  Nash  equilibrium  profile  such  that  the  value  of  each  player  i 
at  state  s  >Vi. 

Proof.  We  reduce  the  3-SAT  problem  to  the  problem  of  determining  whether 
there  is  an  equilibrium  such  that  each  player  has  a  value  >  u,  at  state  s.  Given 
a  3-SAT  formula  ^  we  construct  the  game  as  described  above.  Each  player 
gets  a  value  1  at  state  1  iff  the  formula  ^  is  satisfiable.  If  the  formula  is  satisfiable 
then  consider  a  satisfying  assignment  to  the  variables.  Then  at  each  state  i  all 
the  players  chose  the  move  as  specified  by  the  satisfying  assignment  and  hence 
every  player  get  a  payoff  1.  If  all  the  players  get  an  payoff  1  in  the  game  G^  it 
follows  from  the  construction  of  G^  that  there  is  an  assignment  such  that  every 
clause  is  satisfied  and  hence  the  3-SAT  formula  ^  is  satisfiable.  I 

The  Nash  equilibrium  condition  in  memoryless  strategies  can  be  written  as 
a  sentence  in  the  first  order  theory  of  reals  with  addition  and  multiplication 
((K,  -b,  •)).  The  length  of  the  sentence  is  polynomial  in  the  size  of  the  game  and 
the  depth  of  the  quantifiers  is  constant.  This  gives  an  EXPTIME  procedure 
for  the  following  decision  problem:  given  a  game  G  and  a  value  vector  v  = 
{vi,V2,  ■  ■  ■  ,Vn)  is  there  an  e-Nash  equilibrium  in  memoryless  strategy  profile 
such  that  each  player  i  gets  payoff  >  n,.  Notice  that  the  reduction  to  the  theory 
of  reals  with  addition  and  multiplication  allows  us  to  solve  other  problems  in  a 
similar  way.  Eor  example,  given  a  game  G  whether  there  is  an  e-Nash  equilibrium 
in  memoryless  strategy  profile  such  that  player  i  gets  a  payoff  at  least  n,  can  be 
solved  in  time  exponential  in  the  game  and  polynomial  in  log(i)  using  binary 
search  in  the  interval  [0, 1]. 

Since  the  number  of  Nash  equilibria  where  each  player  gets  a  payoff  1  is  ex¬ 
actly  the  number  of  satisfying  assignments,  the  following  corollary  is  immediate. 

Corollary  1.  Counting  the  number  of  Nash  equilibria  in  reachability  games 
where  each  player  gets  at  least  a  given  payoff  is  #P-hard. 

4.2  Approximating  equilibrium  value  in  NP 

We  will  show  that  the  memoryless  e-equilibrium  profile  can  be  approximated  by 
a  fc-uniform  memoryless  strategy  profile.  We  will  use  a  result  by  Tipton  et.al.  [18]. 
Given  a  n-player  stochastic  reach- a-set-game  G  we  use  |5|  to  denote  the  size  of 
the  state  space  and  I  to  denote  the  maximum  number  of  moves  available  to  any 
player  at  any  state  of  G. 

Definition  6  (Pure  selector).  A  selector  function  <7i  for  player  i  is  pure  if  for 
all  states  s  G  S  we  have  that  there  is  an  action  a,  €  A,  such  that  (T,(aj|s)  =  1.1 

Definition  7  (A:-uniform  selector  and  A:-uniform  memoryless  strategy). 

A  selector  function  af  for  player  i  is  a  k-uniform  selector  if  for  all  states  s  G  S 
we  have  af  is  the  uniform  distribution  on  a  multiset  M  of  pure  selectors  with 
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\M\  =  k.  A  selector  profile  cr*  =  ((Tj  jcf ,  •  •  •  jCr*)  is  k-uniform  if  all  the  selec¬ 
tors  af  is  a  k-uniform  selector  for  all  i  €  {l,2,...,n}.  A  memoryless  strat¬ 
egy  profile  =  ((Tj  (Tj  ,  0'*’°°)  is  k-uniform  is  the  selector  profile 

(T*  =  ((Tj ,  (T* ,  •  •  • ,  O'*)  corresponding  to  the  strategy  profile  is  k-uniform.  I 

Lemma  6  ([18]).  Let  J  he  a  matrix-game  (1  Step  game)  with  n-players  and 
each  player  has  almost  I  moves.  Let  tt  be  a  Nash- equilibrium  strategy /selector. 
Then  for  every  e  >  0  there  exists,  for  every  k  >  ,  a,  set  of  k-uniform 

strategy  /selector  such  that  the  deviation  from  any  pure  strategy  /selector  with 
positive  support  in  tt  is  less  than  e. 

Definition  8  (Difference  of  two  MDP’s).  Let  Gi  and  G2  be  two  MDP’s 
defined  on  the  same  state  space  S.  The  difference  of  the  two  MDP’s,  denoted 
err(Gi,G2),  is  defined  as 

err{Gi,G2)=  ^  |pi(s|s')  - -P2(s|s')l 

s,s' 

That  is,  err(Gi,G2)  is  the  sum  of  the  difference  of  the  probabilities  of  all  the 
edges  of  the  MDP’s.  I 

Lemma  7.  Let  G^  be  a  discounted  n-player  stochastic  reach-a-set-game  and 
a°°  be  a  memoryless  Nash  equilibrium  profile  with  selector  profile  a  = 
(oi ,  02, . . . ,  o'n).  Then  for  every  e  >  0,  there  exists,  for  every  k  >  , 

a  set  of  k-uniform  memoryless  strategy  profile  <7*’°°  (with  selector  profile 
such  that  the  following  holds: 

—  for  any  player  i,  the  MDP’s  and  G^h  satisfy 

err(Ga-_i ,  G^h  )  <  e 

Proof.  It  follo-ws  from  Lemma  6  that  there  is  a  selector  profile  o*  such  that  for 
any  player  i  the  deviation  (or  error)  of  af  from  any  pure  strategy  ■with  positive 
support  of  O',  at  any  state  s  €  5  is  atmost  Since  there  are  n  players  for 

any  edge  the  difference  in  probabilities  in  Ga_i  and  G^k  is  atmost  |^.  Since 
there  can  be  atmost  |5p  edges  the  result  follows.  I 

Lemma  8.  Given  a  n-player  discounted  stochastic  reach-a-set-game  G^  then 
for  every  e  there  exists,  for  every  k  >  ^  there  is  a  k-uniform  memory¬ 
less  strategy  profile  <7*’°°  =  (0'*’°°,  o'2’°°,  •  •  • ,  such  that  <7*’°°  is  an  e-Nash 

equilibrium  profile  in  the  game  G^ . 

Proof.  The  result  follows  from  Lemma  7  and  Lipschitz  continuity  of  values  of 
MDP’s  with  respect  to  err  (Theorem  4.3.7,pg-185  Filar- Vrieze  [10]).  I 

Lemma  9.  Given  a  n-player  stochastic  reach-a-set-game  G  for  every  e  >  0, 
there  exists,  for  every  k  >  ^  a  k-uniform  memoryless  strategy  profile 

(7*’°°  such  that  (7*’°°  is  a  e-equilibrium  profile  for  the  game  G. 
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Proof.  Let  Gi  he  a  discounted  version  of  the  G  with  a  discount  factor  Let 
^k,oo  fc-uniform  memoryless  strategy  profile  such  that  is  a  |-Nash 

equilibrium  profile  in  Gi .  Existence  of  such  a  follows  from  Lemma  8.  Then 
^k,oo  jg  strategy  profile  which  is  an  e-equilibrium  profile  in  the  game  G.  I 

Lemma  10.  Given  a  constant  e  the  value  of  an  e-equilibrium  with  a  memoryless 
strategy  profile  of  a  n-player  stochastic  reach- a- set- game  can  be  approximated 
within  e  tolerance  by  an  NP  algorithm. 

Proof.  The  NP  algorithm  guesses  a  fc-uniform  selector  cr*  for  a  fc-uniform  mem¬ 
oryless  e-equilibrium  strategy  profile  It  then  verifies  that  the  value  for  the 

MDP’s  G^k  for  every  state  s  £  S  and  each  player  i  is  within  e-tolerance  as  com¬ 
pared  to  the  value  of  the  Markov  Chain  define  by  G„h .  Since  the  computation 
of  values  of  a  MDP  can  be  achieved  in  polynomial  time  (using  Linear  program 
solution)  it  follows  that  the  approximation  within  e  tolerance  can  be  achieved 
by  a  NP-algorithm.  I 

Lemmas  5  and  10  yield  Theorem  2. 

5  Games  with  Turns 

An  n-person  stochastic  game  is  turn-based  if  at  each  state,  there  is  exactly  one 
player  who  determines  the  next  state.  Formally,  we  extend  the  action  sets  A, 
for  i  =  1, ...  ,n  to  be  state  dependent,  that  is,  for  each  state  s  £  S,  there  are 
action  sets  Ais  for  i  =  1, ...  ,n,  and  we  restrict  the  action  sets  so  that  for  any 
s  £  S,  there  is  at  most  one  i  £  {  1, . . . ,  n  }  such  that  >  1.  A  strategy  tt,  for 
player  i  is  pure  if  for  every  history  h  =  {so,a^  ,si, . . .  ,a^,Sk)  there  is  a  action 
a*  G  Ais^  such  that  TTi{a)  =  1.  In  other  words,  a  strategy  is  pure  if  for  every 
history  the  strategy  chooses  one  action  rather  than  a  probability  distribution 
over  the  action  set.  A  strategy  profile  is  pure  if  all  the  strategies  of  the  profile 
are  pure. 

We  consider  payoff  functions  that  are  index  sets  of  Borel  sets  (see  e.g.,  [15] 
for  definitions),  that  is,  given  a  Borel  set  B,  we  consider  a  payoff  function  xb 
that  assigns  a  payoff  1  to  a  play  that  is  in  the  set  B,  and  0  to  a  play  that  is 
not  in  the  set  B.  With  abuse  of  notation,  we  identify  the  set  B  with  the  payoff 
function  xb  ■  We  consider  turn  based  games  in  which  each  player  is  given  a  Borel 
payoff  Bi.  If  n  =  2,  we  call  the  game  two-player.  A  two-player  Borel  game  is 
zero  sum  if  the  payoff  set  B  of  one  player  is  the  complement  \  B  oi  the 
other  player,  that  is,  the  players  have  strictly  opposing  objectives.  Borel  sets  are 
studied  in  descriptive  set  theory  for  their  rich  structural  properties.  A  deep  result 
by  Martin  shows  that  two  player  zero  sum  infinite  stochastic  games  with  Borel 
payoffs  have  a  value  [23].  The  proof  constructs,  for  each  real  v  £  (0, 1]  a  zero  sum 
turn-based  deterministic  infinite-state  game  with  Borel  payoff  such  that  a  (pure) 
winning  strategy  for  player  1  in  this  game  can  be  used  to  construct  a  (mixed) 
winning  strategy  in  the  original  game  that  assures  player  1  a  payoff  of  at  least  v. 
From  the  determinacy  of  turn-based  deterministic  games  with  Borel  payoffs  [22] , 
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the  existence  of  value  in  zero  sum  stochastic  games  with  Borel  payoffs  follows. 
Moreover,  the  proof  constructs  e-optimal  mixed  winning  strategies.  A  careful 
inspection  of  Martin’s  proof  in  the  special  case  of  turn-based  probabilistic  games 
shows  that  the  e-optimal  strategies  of  player  1  are  pure.  The  mixed  strategies 
are  derived  from  solving  certain  one-shot  concurrent  games  at  each  round.  In 
our  special  case  these  one-shot  games  have  pure  winning  strategies  since  only 
one  player  has  a  choice  of  moves. 

Lemma  11.  Pure  memory  determinaey  For  eaeh  e  >  0  there  is  a  pure  strategy 
TTi  of  player  1  sueh  that  for  all  strategies  7r2  of  player  2  f  }  >  v  —  e. 

Theorem  3.  For  eaeh  e  >  0  there  exists  an  e-Nash  equilibrium  in  every  n-player 
turn  based  probabilistie  games  with  Borel  payoffs. 

Proof.  Our  construction  is  based  on  a  general  construction  from  repeated  games. 
The  basic  idea  is  that  player  i  plays  optimal  strategies  in  the  zero  sum  game 
against  all  other  players,  and  any  deviation  by  player  i  is  punished  indefinitely 
by  the  other  players  by  playing  e-optimal  spoiling  strategies  in  the  zero  sum 
game  against  player  i  (see,  e.g.,  [25,34]).  Let  player  i  have  the  payoff  set  B,, 
for  i  =  1, . . . ,  n.  Consider  the  n  zero  sum  games  played  between  i  and  the  team 
[n]\{i},  with  the  winning  objective  for  i.  By  lemma  11  here  is  a  pure  e-optimal 
strategy  tt-  for  player  i  in  this  game,  and  a  pure  e-optimal  spoiling  strategy  for 
players  j  T  This  spoiling  strategy  induces  a  strategy  Trj  for  each  player  j  i. 
Now  consider  the  strategy  r*  for  player  i  as  follows.  Player  i  plays  the  strategy 
TT-  as  long  as  all  the  other  players  j  play  7rj  and  switch  to  tt]  as  soon  as  some 
player  j  deviates.  Since  the  strategies  are  pure,  any  deviation  is  immediately 
noted.  The  strategies  r*  for  i  =  1, . . . ,  n  form  an  e-Nash  equilibrium.  I 

Notice  that  the  construction  above  for  probabilistic  Borel  games  guarantees 
only  e-optimality.  As  a  special  case,  using  the  determinaey  result  of  [22],  we  get 
that  turn  based  deterministic  games  (perfect  information  games)  with  payoffs 
corresponding  to  Borel  sets  have  Nash  equilibria. 

Corollary  2.  Every  turn-based  deterministie  game  with  payoffs  eorresponding 
to  Borel  sets  has  a  Nash  equilibrium  with  pure  strategy  profile. 

A  particularly  interesting  case  of  turn  based  probabilistic  games  is  when 
each  payoff  function  B  is  an  w-regular  set  [21].  Games  with  w-regular  winning 
conditions  are  used  in  the  verification  and  control  of  (probabilistic)  systems  [1, 
31,6].  In  the  special  case  of  turn-based  probabilistic  games  with  parity  winning 
conditions,  pure  and  memoryless  optimal  winning  strategies  exist  for  two  player 
zero-sum  case  [3].  Moreover,  the  pure  memoryless  optimal  strategies  can  be 
computed  in  NP  n  coNP.  Therefore  we  have  the  following. 

Proposition  1.  There  exists  a  Nash  equilibrium  with  pure  strategy  profile  in  ev¬ 
ery  turn-based  probabilistie  game  with  parity  payoff  eonditions.  The  value  profile 
of  some  Nash  equilibrium  ean  be  eomputed  in  FNP. 
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6  Games  with  Two  Players 

In  this  section  we  consider  the  special  case  of  two-player  reach-a-set-games, 
namely  two-player  constant-sum  games.  For  this  special  cases  we  prove  a  NP 
n  coNP  bound  to  approximate  the  value  of  a  e-equilibrium  profile,  given  a  fixed 

e. 

6.1  Two-player  constant-sum  reach-a-set-games 

We  now  define  a  two-player  constant-sum  reach-a-set-game. 

Definition  9  (Two-player  constant-sum  reach-a-set-games).  For  a  two- 
player  reach-a-set-game  G  let  U  denote  the  set  of  all  e-equilihrium  strategy  profile 
for  e  >  0.  We  use  the  following  notation 

vi{s)  =  sup  and 

ttEII 

V2{s)  =  sup 

ttEII 

The  game  is  eonstant-sum  if  for  all  state  s  G  S  we  have  the  following  conditions: 

-  Vi(s)  V2(s)  =  1. 

—  for  all  TT  &  n  we  have  +  Es^T^(f)^'^  =1.1 

We  now  prove  computing  the  values  ui(s)  and  V2(s)  within  a  e-tolerance, 
given  a  fixed  e  can  be  achieved  in  NP  n  coNP. 

Lemma  12.  Let  G  be  a  two-player  constant-sum  reach-a-set-game,  s  an  initial 
state  and  and  be  two  values.  For  a  fixed  e  it  can  be  determined  in  NP  n 
coNP  whether 

ui(s)  >v^  —  e. 

Proof.  It  follows  from  Lemma  9  that  there  is  a  A:-uniform  memoryless  e- 
equilibrium  profile  <72’°°)  with  selector  profile  cr*  =  (ffi ,  (rf).  Since 

two-player  constant-sum  reach-a-set  game  is  a  special  case  of  n-player  stochas¬ 
tic  reach-a-set-game  it  follows  from  Lemma  10  that  the  two-player  constant-sum 
reach-a-set-game  unique  equilibrium  value  can  be  approximated  by  an  algorithm 
in  NP. 

To  prove  that  there  is  a  coNP  algorithm  consider  the  case  when  ui  (s)  <  v^—e. 
The  coNP  algorithm  guesses  the  fc-uniform  selector  o-f  for  player  2  and  verifies 
that  the  value  of  player  1  in  the  state  s  in  the  MDP  G^k  ^  is  less  than  —  e. 
Since  the  value  of  a  MDP  at  any  state  can  be  computed  in  polynomial  time 
(using  a  Linear  program  solution)  the  required  result  follows.  I 

Theorem  4  (Two-player  constant-sum  reach-a-set-games).  Given  a  fixed 
e  the  value  of  an  e-equilibrium  profile  of  two-player  stochastic  constant-sum 
reach-a-set-games  can  be  computed  in  NP  n  coNP. 
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6.2  Concurrent  Reachability  Games 

We  now  show  that  the  values  of  two  player  concurrent  reachability  games  (zero- 
sum  reach-a-set-games)  can  be  approximated  within  e  tolerance  in  NP  n  coNP. 
The  previous  best  known  algorithm  was  exponential  [8]. 

A  two-player  concurrent  reachability  game  [7]  G  is  a  two-player  stochastic 
game  with  i?i  C  5  as  a  target  set  of  states  for  player  1.  Given  a  random  history 
h  =  {so,a}  ,. . .)  player  1  gets  a  payoff  1  if  the  history  contains  a  state  in 

i?i,  else  the  player  2  gets  an  payoff  1.  In  other  words,  player  1  plays  a  reachability 
game  with  target  set  Ri  and  player  2  plays  a  safety  game  with  its  safe  set  of 
state  52,  where  =  S\Ri.  Let  Ui  and  772  be  the  set  of  ah  strategies  of  player  1 
and  player  2  respectively.  Then  for  any  state  s  €  5  we  use  the  following  notation 

ui(s)=  sup  inf 

TTlGill  ^2Gii2 

V2<ys)  =  sup  inf 
jr2GiT2 

It  follows  from  determinacy  of  Blackwell  games  [23]  that  for  all  states  s  €  5, 
we  have  ui(s)  -l-U2(s)  =  1.  Let  W2  =  {s|u2(s)  =  1}  and  Wi  =  {s|ui(s)  =  1}. 
We  prove  that  the  concurrent  reachability  game  can  be  reduced  to  a  two-player 
reach-a-set  game  Ge  with  Wi  and  W2  as  the  target  set  of  states  for  player  1 
and  player  2,  respectively  and  also  all  the  states  in  Wi  and  W2  are  absorbing 
states  or  sink  states.  In  the  proof  below  we  use  the  following  notation 

reach-^’--{W2){s)  = 
reach-^’--{Wi){s)  = 

reach(yV2){s)  =  sup  inf  reacK"^'’^'^  (yV2){s) 
jr2GiT2 

Lemma  13.  Let  G  be  a  eoneurrent  reaehahility  game  and  Gr  be  the  two- 
player  reaeh-a-set-game  with  the  target  set  for  player  1  and  player  2  being 
Ri  =  LPi,i?2  =  W2,  respeetively.  Also  every  state  in  Wi  U  W2  is  an  absorb¬ 
ing  state  and  onee  the  proeess  of  states  reaehes  a  state  in  Wi  or  W2  it  remains 
there  forever.  Then,  for  all  states  s  G  S  we  have  V2(s)  =  reach(W2)(s) . 

Proof.  From  every  state  s  €  W2  there  is  a  strategy  tTj  such  that  player  2  can 
stay  in  its  safety  set  5  \  i?i  with  probability  1.  Hence  combining  a  strategy  to 
reach  the  set  W2  with  the  strategy  ttj  we  get  that  ^2(5)  >  reach{W2){s). 

Suppose  V2{s)  >  reach{W2){s).  It  follows  from  [8]  that  player  2  has  an  op¬ 
timal  memoryless  strategy  in  the  concurrent  reachability  game  G.  Let  7r2  be  an 
optimal  memoryless  strategy  for  player  2  in  the  concurrent  reachability  game 
G.  Fixing  the  memoryless  optimal  strategy  7r2  for  player  2  in  the  game  Gr  we 
get  an  MDP  Gr^„.^  where  at  each  state  player  2  plays  according  to  the  strategy 
7r2.  Let  an  optimal  memoryless  strategy  of  player  1  against  the  strategy  7r2  in 
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the  game  G  be  tti.  The  game  G„j,„2  is  a  Markov  chain.  Let  C  be  any  terminal 
strongly  connected  component  of  the  Markov  chain  If  G  fl  i?i  0  then 

from  every  state  s  €  G  player  1  wins  with  probability  1,  and  if  G  fl  i?i  =0  then 
from  every  state  s  €  G  player  2  wins  with  probability  1.  Since  is  an  optimal 
strategy  and  tti  is  an  optimal  strategy  against  we  have  that  every  terminal 
strongly  connected  component  is  a  subset  of  Wi  or  W2.  Hence  in  the  Markov 
chain  we  have 

reach^^’^^Wi){s) +reach^^’^^W2){s)  =  1 

Nowifu2(s)  >  reach'^^’'^‘^{Wi){s)  then  we  have  (ITi)(s)  <  1  — ^2(5)  = 

ui(s).  Since  7r2  is  an  optimal  strategy  for  player  2  and  tti  is  an  optimal 
strategy  against  it  we  must  have  reack"^'’^'^  {Wi){s)  =  ui(s).  Hence  this  is  a 
contradiction.  Therefore,  we  have  v^is)  <  reach{W2){s).  Hence  proved  that 
^2(5)  =  reach(yV2){s).  I 

Lemma  14.  Given  a  fixed  e,  the  values  ui(s)  and  V2(s)  of  a  eoneurrent  reaeh- 
ability  game  ean  be  approximated  within  e  toleranee  in  NP  fl  eoNP. 

Proof.  It  follows  from  Lemma  13  that  a  concurrent  reachability  game  can  be 
reduced  to  a  two-player  stochastic  reach-a-set  game  with  target  set  for  player  1 
and  player  2  being  Wi  and  W2  respectively.  It  follows  from  the  result  of  deAlfaro 
and  Henzinger  [6]  that  the  sets  Wi  and  H2  can  be  computed  in  polynomial  time. 
It  follows  from  the  result  of  Martin  on  determinacy  of  Blackwell  games  [23]  that 
this  game  is  a  constant-sum  two-player  stochastic  reach-a-set-game.  The  result 
then  follows  from  Lemma  12.  I 

Corollary  3  (Two-player  concurrent  reachability  games).  The  value  of  a 
two-player  eoneurrent  reaehability  game  ean  be  approximated  within  e-toleranee 
in  NP  n  eoNP,  given  a  fixed  e. 

The  natural  question  at  this  point  is  whether  there  is  a  polynomial  time  algo¬ 
rithm  for  concurrent  zero  sum  reachability  games.  Since  simple  stochastic  games 
[4]  can  be  easily  reduced  to  concurrent  reachability  games,  a  polynomial  time 
algorithm  for  this  problem  will  imply  a  polynomial  time  algorithm  for  simple 
stochastic  games  and  mean  payoff  games  [37].  These  have  been  long  standing 
open  problems  in  the  area. 
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